Skip to content

feat: Add Ministral-3-3B VLM recipe with INT4 quantization and eval benchmarks#352

Open
titaiwangms wants to merge 7 commits intomainfrom
ministral-3b-text-export
Open

feat: Add Ministral-3-3B VLM recipe with INT4 quantization and eval benchmarks#352
titaiwangms wants to merge 7 commits intomainfrom
ministral-3b-text-export

Conversation

@titaiwangms
Copy link
Copy Markdown

@titaiwangms titaiwangms commented Apr 8, 2026

Summary

Adds a complete Olive recipe for exporting Ministral-3-3B-Instruct-2512 (Pixtral) as a 3-model VLM pipeline for ONNX Runtime GenAI:

  • Text decoder — Olive/ModelBuilder (GQA attention, YaRN RoPE, INT4 quantization)
  • Vision encoder — Mobius declarative export (Pixtral, dynamic H×W, 2D RoPE)
  • Embedding — Mobius export (token + image fusion)

Configurations

Component CUDA CPU
Text decoder INT4 (MatMulNBits) INT4 (MatMulNBits)
Vision encoder FP16 INT4 (MatMulNBits via Olive)
Embedding FP16 FP32

Benchmark Results (AI2D)

Configuration Accuracy Samples Latency (s/sample) Gap vs PyTorch
PyTorch FP32 (CPU) 72.00% 100 21.66 — baseline —
PyTorch FP16 (CUDA) 73.00% 200 0.20 — baseline —
ONNX CUDA (INT4 text + FP16 vision) 71.65% 200 0.11 −1.35 pp
ONNX CPU (INT4 text + FP32 vision) 71.13% 194 26.86 −0.87 pp
ONNX CPU (INT4 text + INT4 vision) 69.07% 194 33.28 −2.93 pp

All ONNX configs within expected INT4 precision gap (<5 pp). CUDA ONNX achieves 2× speedup over PyTorch CUDA FP16.

Key Features

  • _strip_unused_initializers() — removes dead weights from Olive INT4 output, reducing vision model from 1.7 GB → 220 MB (~90% size reduction)
  • _fix_gather_block_quantized() — preserves RoPE position cache through INT4 quantization by converting GatherBlockQuantized back to fp32 Gather
  • eval.py — AI2D benchmark tool comparing ONNX vs PyTorch baselines with per-sample logging
  • genai_config.json generation — auto-generates 3-model VLM runtime config with Pixtral image preprocessing

Dependencies

  • onnxruntime-genai PR #2076 — YaRN RoPE parity fixes (inv_freq, mscale, rope_theta fallback)
  • onnxruntime-genai PR #2077 — Mistral3 VLM support (C++ image processor, INT32 input_ids, context_length/max_length separation)
  • mobius PR [Scanner]: Ran scanner and update README.md #130 — Mistral3 vision/embedding export support

Known Limitations

  • CPU INT4 vision: language drift — INT4-quantized vision encoder occasionally produces embeddings causing wrong-language responses (e.g., Chinese instead of English on challenge.jpg). FP16 vision (CUDA) does not exhibit this.
  • Single-image only — multi-image inputs not yet supported
  • FP8 checkpoint — default HF model uses FP8 weights; use -BF16 variant for PyTorch baselines

Copilot AI review requested due to automatic review settings April 8, 2026 23:15
@titaiwangms titaiwangms marked this pull request as draft April 8, 2026 23:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “builtin” export + inference recipe for mistralai/Ministral-3-3B-Instruct-2512, targeting ONNX Runtime GenAI by exporting the text decoder via Olive/ModelBuilder and the vision/embedding pieces via Mobius, plus generating the runtime genai_config.json/processor_config.json.

Changes:

  • Introduces an end-to-end export/config-generation script (optimize.py) and a GenAI inference example (inference.py).
  • Adds Olive configs for CPU/mobile (INT4) and CUDA (FP16), along with recipe metadata (info.yml) and docs (README.md).
  • Adds custom patched modeling code under codes/ intended to support ONNX export.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
mistralai-Ministral-3-3B-Instruct-2512/builtin/user_script.py Adds model config constants (currently with import-time HF loading).
mistralai-Ministral-3-3B-Instruct-2512/builtin/requirements.txt Declares Olive + Mobius + torch/transformers dependencies.
mistralai-Ministral-3-3B-Instruct-2512/builtin/README.md Documents export workflow, output layout, and inference usage.
mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py Implements export pipeline and GenAI config/tokenizer patching.
mistralai-Ministral-3-3B-Instruct-2512/builtin/info.yml Registers builtin recipe metadata (keywords/EPs/devices/name).
mistralai-Ministral-3-3B-Instruct-2512/builtin/inference.py Provides a CLI to run text-only and multimodal inference with ORT GenAI.
mistralai-Ministral-3-3B-Instruct-2512/builtin/cuda/text.json Olive ModelBuilder config for FP16 CUDA decoder export.
mistralai-Ministral-3-3B-Instruct-2512/builtin/cpu_and_mobile/text.json Olive ModelBuilder config for INT4 CPU/mobile decoder export.
mistralai-Ministral-3-3B-Instruct-2512/builtin/codes/modeling_ministral3.py Adds patched model components for ONNX-export-friendly behavior.
mistralai-Ministral-3-3B-Instruct-2512/builtin/codes/init.py Exposes Ministral3Model symbol.
mistralai-Ministral-3-3B-Instruct-2512/builtin/.gitignore Ignores generated model artifacts and Olive cache.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/user_script.py Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/requirements.txt Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/inference.py
@titaiwangms titaiwangms force-pushed the ministral-3b-text-export branch 2 times, most recently from 8058122 to 9d5c64b Compare April 9, 2026 21:31
@titaiwangms titaiwangms changed the title Add Ministral-3-3B-Instruct-2512 recipe Add Ministral-3-3B VLM recipe: hybrid Olive + Mobius export Apr 9, 2026
@titaiwangms titaiwangms force-pushed the ministral-3b-text-export branch 2 times, most recently from b3f8592 to 5969770 Compare April 10, 2026 00:04
@titaiwangms titaiwangms marked this pull request as ready for review April 10, 2026 21:58
@titaiwangms titaiwangms force-pushed the ministral-3b-text-export branch from d3f7f6a to 5eb675d Compare April 14, 2026 20:23
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py
@titaiwangms titaiwangms force-pushed the ministral-3b-text-export branch 5 times, most recently from 9d5b928 to 7a914be Compare April 14, 2026 21:52
Complete olive recipe for Ministral-3-3B-Instruct-2512 VLM using:
- Text decoder: Olive/ModelBuilder (INT4 for both CPU and CUDA)
- Vision encoder + embedding: Mobius (dynamo-free ONNX construction)
- Vision INT4 quantization: Olive post-export (CPU only)
- context_length=32768, Permute3D transform in processor_config

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@titaiwangms titaiwangms force-pushed the ministral-3b-text-export branch from 7a914be to 1bdc231 Compare April 14, 2026 22:08
- Add _strip_unused_initializers to reduce INT4 model size (1.7GB→220MB)
- Add _fix_gather_block_quantized for RoPE cache preservation
- CUDA: INT4 text + FP16 vision (71.65% AI2D)
- CPU: INT4 text + INT4 vision (69.07% AI2D)
- Remove unnecessary genai_config overrides (trust ModelBuilder)
- Add comprehensive README with benchmark results
- Fix eval.py build_messages for Jinja sort compatibility

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/eval.py Fixed
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/eval.py Fixed
@titaiwangms titaiwangms changed the title Add Ministral-3-3B VLM recipe: hybrid Olive + Mobius export feat: Add Ministral-3-3B VLM recipe with INT4 quantization and eval benchmarks Apr 15, 2026
@titaiwangms titaiwangms requested a review from Copilot April 15, 2026 22:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/README.md
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/eval.py
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/README.md Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/README.md Outdated
titaiwangms and others added 5 commits April 15, 2026 22:39
- eval.py: Add explanatory comments to except-pass clauses
- optimize.py: Update docstring to match INT4 shipping config
- optimize.py: Document _get_hf_config MODEL_NAME usage
- optimize.py: Improve --dtype help text
- README.md: Fix precision labels (CUDA=INT4 text, CPU embedding=FP16)
- README.md: Remove stale FP32 embedding references

Note: eval.py dtype= kwarg is valid in transformers >=5.0

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ation

When --models-dir differs from the default (<config-dir>/models/),
text.json output_dir is hardcoded so exports go to the default location.
Copy the entire export tree to --models-dir after export so that
update_genai_config() and fix_tokenizer() find the files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CUDA graph capture is unsupported for VLMs with dynamic image sizes.
Set enable_cuda_graph=0 for ALL models (decoder, vision, embedding),
matching the Qwen VLM recipe convention.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Olive caches the full resolved config including absolute output_dir.
On re-runs with different --models-dir, the stale cache writes to the
old path, creating unexpected directories (e.g., ministral3-cpu-int4-test).
Clear the cache before each quantization run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- export_text_decoder: Load text.json as dict, override output_dir
- export_vision_and_embedding: Already accepts output_dir parameter
- quantize_vision_and_embedding: Load vision.json as dict, override
  model_path and output_dir
- Remove shutil.copytree post-export step from main()
- Remove .olive-cache clear (no longer needed)
- Pass models_dir through export_models() pipeline

This eliminates duplicate directories, copy overhead for multi-GB files,
and ghost directories from stale Olive cache paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants